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1. INTRODUCTION 

Multiple-Input Multiple-Output (MIMO) technique plays a vital role in current day wireless 
communication systems to reach higher data rates, improved reliability, high throughput and capacity [1]. In 
[2], different channel models for MIMO were simulated. Although there is a quite a lot of performance 
improvement, the complexity of the receiver increases significantly. In order to reduce the complexity of 
algorithms for MIMO detection, QR decomposition is of critical importance. It decomposes the channel 
matrix H into orthogonal matrix Q and an upper triangular matrix R. The QR decomposition is performed 
every time the channel impulse response changes significantly. Noise enhancement problem is greatly 
reduced in QR decomposition due to the property of unitary transformations, thus minimizing the chance of 
erroneous detection arising from noise. Successive Interference Cancellation (SIC), V-BLAST and tree 
search based detection algorithms utilize these matrices to detect the received complex-valued signal vector. 

QR decomposition technique can be computed using several methods. They are Householder 
transformations, Givens rotations and Gram-Schmidt orthogonalization. These transformations can be made 
easier with CORDIC algorithm [3] leading to low-complexity solution for hardware realization. In small 
scale MIMO detector, the channel matrix size is small and requires a lower processing speed for QR 
decomposition. So, in these systems parallel systolic array processors with reduced dimensions are not 
justified to be used. In [4], Lin discussed QR decomposition based on Givens Rotation with CORDIC 
algorithm. Hwang in [5] implemented complex QR factorization based on Givens rotation for real-time 
detection of MIMO signal and also several hardware reduction techniques like constant multiplier sharing 
and look-up table elimination for CORDIC modules were devised. High speed hardware multipliers were 
evaluated in [6]. Nazar in [7] discussed low complexity hardware architecture for QR decomposition. 

Therefore, the objective of this paper is to propose a QR Decomposition (QRD) module for square 
channel matrices based on Givens rotation method which allows low complexity decomposition of the 
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channel matrices, by reducing the number of computations. The Givens rotation of the channel matrix was 
performed using CORDIC module designed with Xilinx System Generator Block set to reduce the hardware 
complexity and power consumption. The proposed QRD module can be used as a pre-processing unit in 
MIMO detection unit. 

In Section 2, the block diagram of MIMO system is explained. Section 3 briefs the basic CORDIC 
algorithm, its different operating modes and the pipelined and scaled implementation using Xilinx System 
Generator. Section 4 describes the proposed QRD module and its implementation. VLSI architecture 
implementation results are discussed in Section 5. Section 6 concludes the paper. 


2. RESEARCH METHOD 
Consider a MIMO system, with M transmit antennas and N receive antennas, with the assumption 
that N>M. The complex baseband equivalent model for the considered flat fading MIMO wireless channel 


yields an N-dimensional received vector y=[y,,...yy]' given by the equation 


y=Hs+n (1) 


Sphere Detector 


Symbol set 


H -> Channel Matrix PEO Partial Euclidian distance 
Y -> Received Signal Vector R-> Upper Triangular Matrix 
Q> Orthogonal Matrix 


Figure 1. MIMO Detection unit 


where H is the NxM complex valued channel matrix , s =[s,,...s,,]' is the transmitted signal vector and n is 


the additive noise vector. Figure 1 shows the MIMO detection unit containing the pre-processing unit and the 
sphere detector unit. Most of the detection algorithms for MIMO systems start by decomposing the channel 
matrix H into unitary matrix Q and an upper triangular matrix R. The pre-processing unit performs the QR 
Decomposition in the detector and then the received signal vector is rotated by Q matrix as y’=Q”y and 


fed to the detector unit. 


3. CORDIC ALGORITHM 

CORDIC (COordinate Rotation DIgital Computer) algorithm can perform several computations like 
trigonometric, logarithmic and hyperbolic functions, real and complex multiplications, square-root, division, 
Eigenvalue estimation, QR decomposition and many other functions using simple shift and add operations 
[8]. CORDIC may not be fastest but it is drawing attention due to its simple hardware implementation and 
once designed it can be used in all the previously mentioned applications. In this section, the basic principles 
of CORDIC algorithm and its different modes are discussed. The main concept used in CORDIC algorithm is 
(i) to decompose the rotations into a sequence of elementary rotations by predefined angles that could be 
implemented with minimum hardware resource utilization and (ii) to avoid arithmetic operation for scaling 
such as square-root and division and also by considering the fact that the scale-factor contains only the 
magnitude information but no information about the rotation angle. There are basically two operating modes 
namely vectoring mode and rotation mode. 


3.1. Vectoring Mode 

The CORDIC unit rotates the input vector with the required angle in the vectoring mode, in order to 
align the result vector with the x-axis. The rotation angle and the scaled magnitude of the input vector i.e., the 
x-component are the output of the vectoring mode. The y-component of the input vector is minimized at each 
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micro-rotation and the sign of the y-component is used to determine the direction of the next rotation. In 
vectoring mode, the equations are written as [9], [3]: 


Xin = Xi y,d,2"' (2) 
Yia =i + X0;2" (3) 


Zi, = Zz, —d, tan” (2>) (4) 


where 4 SPL aes ea 


otherwise. 
3.2. Rotation Mode 

In rotation mode, the CORDIC unit is useful for performing vector rotations of the given input 
vector. First, the desired rotation angle is initialized with input. After each iterative rotation, the magnitude of 
the angle is made to diminish and the decision is based on the sign of the angle after each step. In rotation 
mode, the equations are given by 


Xi =X T yd; 2” (5) 


Yin = yı +x;d, 2” (6) 


Z =Z; —d, tan (2) (7) 


TE if y, <0,-1 otherwise. 


where d= 
3.3. Pipelined CORDIC unit 

The vectoring CORDIC unit is connected in pipeline to attain rotation close to the nulling axis. The 
number of iterations or pipeline stages for sufficient accuracy was found by manual calculations and by 
Zahid Khan in [10] as 13. The x component i.e. the output of vectoring mode is scaled by constant 
K=0.6057[11]. Similarly, the rotation CORDIC unit is connected in pipeline to obtain accurate rotation of the 
x- component and the y-component and both are scaled by scaling constant K. 


4. PROPOSED QR DECOMPOSITION MODULE 
The Givens rotation of a 2x2 real-valued channel matrix is given by Huang in [12] as 


[cos jard f a hèth? h 
| = 


d 
sin cos@ ||, hy 0 h 


(8) 


Equation 8 can be implemented using a pipelined and scaled vectoring CORDIC unit followed by 
pipelined and scaled rotation CORDIC unit. The CORDIC vectoring unit normalizes the first column of the 
real 2x2 channel matrix and calculates the rotation angle 0 which is used to rotate the second column of the 
matrix.The resultant first and second column elements form the upper triangular matrix R. The rotation angle 
O is used to rotate the identity matrix to generate the Q matrix. Figure 2 shows the block diagram of the 
proposed QRD module. 


h e= WE ie |sen, ie h, S 


The second column of the channel matrix and the two columns of the identity matrix are fed 
consequently using buffer and reshaping units. Similarly, the output of the second column and the two 
columns of Q matrix are obtained by buffer unit. The proposed QRD module eliminates division, squaring 
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and square root operations which lead to area optimization and less power consumption in hardware 
implementation. 


CORDIC 
Vectoring mode 


CORDIC 
Rotation mode 


Figure 2. Proposed QRD module for real 2x2 matrix 


5. IMPLEMENTATION AND RESULTS 

The proposed QRD module for MIMO detection is implemented using Xilinx System Generator. 
System Generator is a DSP design tool that makes use of Simulink-The Mathworks based design 
environment [13] for FPGA design. Xilinx blocksets are IP cores and used to design the modules required. 
The blocks are polymorphic and the FPGA implementation steps such as synthesis and place and route are 
automatically performed to generate the programming file. Figure 3 shows the implementation of CORDIC 
vectoring unit using Xilinx System Generator. 

The vectoring unit implements the equations 2, 3 and 4 and the angle increment in (4) are stored 
using Block RAM in Xilinx Block set. For each iteration, the angle increment is retrieved from the Block 
RAM by giving the address. The vectoring unit takes one clock cycle(cc) to generate the output. Therefore, 
the latency for the unit is 1cc. Figure 4 shows the CORDIC rotation unit. This unit implements the equations 
5, 6 and 7 and has a latency of 1cc. Figure 5 depicts the pipelined and scaled rotation CORDIC unit for 13 
iterations and Figure 6 shows the scale correction unit for K=0.6057 using simple shift and addition 
operation. 


AddSub2 


Constant 


ROM 


Figure 3. Implementation of Vectoring mode Figure 4. Implementation of Rotation mode 
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Figure 6. Scale Correction unit 


Figure 7 shows the timing diagram of the QRD module. The vectoring CORDIC gives the real 
value output and the rotation angle in 13 clock cycles thereafter the rotation CORDIC makes use of the 
angle and performs rotation of the given input in next 13 clock cycles. Therefore, the overall latency of the 
QRD module is 26 clock cycles for performing the QRD of 2x2 real channel matrix. The computational 
complexity of proposed QRD module is tabulated in Table 1. The proposed QR Decomposition of 2x2 real 
channel matrix requires 84 additions and for a 2x2 complex channel matrix requires 295 additions. The 
number of addition operations performed in each CORDIC unit is 3 and hence, for 13-pipelined vectoring 
CORDIC unit it is 39 and 2 addition operations for Scale Correction Unit together there are 41 addition 
operations. In pipelined rotation unit 39 additions for 13 stages and 4 additions for 2 scale correction unit 
together there are 43 additions. Hence, the proposed design for a 2x2 complex matrix requires 16% 
reduced number of additions compared to Givens rotation method by Hwang in [14]. The 2x2 complex 
channel matrix is implemented using two proposed QRD modules out of which two modules will operate 
in a parallel manner. Therefore, the latency taken by 2x2 complex matrix will be 52 clock cycles. 

As the proposed design is scalable, the QRD module design can be utilized to decompose a 4x4 
channel matrix. Four QRD modules are required for 4x4 real channel matrix and it takes 104 clock cycles 
to obtain a upper triangular matrix which is 42% lesser compared with literature [15]. The decomposition 
requires 336 addition operations which is 9% lesser compared to work done in [14]. 
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Figure 7. Timing Diagram 


Table 1. Computational Complexity Comparison 


A CORDIC based QR Decomposition Technique for MIMO Detection (Shirly Edward.A) 


2736 O ISSN: 2088-8708 


Scheme Matrix Dimension Addition operations 
Proposed real QRD 2x2 84(43+41) 
Proposed complex QRD 2x2 168 
Complex Givens Rotation [14] 2x2 200 
Real QRD (4x4) 4x4 336 
Real Givens Rotation[14] 4x4 368 


Table 2 depicts the Synthesis report of QRD module for 2x2 real channel matrix and 2x2 complex 
channel matrix. The design was targeted to Xilinx Virtex 5 FPGA. The minimum time period taken for the 
design is 2ns i.e. 1cc. Therefore, for 2x2 real matrix and 2x2 complex matrix the latency taken by the module 
is 52ns and 104ns respectively. The number of LUTs used, the number of occupied slices, the number of 
Block RAMs used and the on-chip power consumed by the design at 1OOMHz FPGA clock frequency are 
listed in the table. The Xpower analysis report gives the on-chip power consumption of the module. 


Table 2. Synthesis Report for 2X2 Matrix 


Device Xilinx Virtex 5 xc5vsx240t-2ff1738 
Resources 2x2 real matrix 2x2 complex matrix 
No. of LUTs used 2112 8812 
No. of occupied slices 586 2376 
No. of Block RAMs used 18 53 
On-chip power(mW) at 100MHz 12 44 
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